[fix](insert) fix InsertLoadJob memory leak caused by jobs permanently stuck in PENDING state#62282
[fix](insert) fix InsertLoadJob memory leak caused by jobs permanently stuck in PENDING state#62282sollhui wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
|
/review |
There was a problem hiding this comment.
Found 1 issue.
-
Critical checkpoint: task goal and correctness
Conclusion: Partially met. The PR fixes localOlapInsertExecutorfailure cleanup and avoids registering some non-OLAP paths, but it does not fix the equivalent remote OLAP path, soInsertLoadJobleaks still remain for that execution path. No test covering remote OLAP success/failure registration behavior was added. -
Critical checkpoint: change size and focus
Conclusion: Mostly focused and small. The constructor opt-in for registration is a reasonable narrowing, but the fix is not consistently applied to all functionally parallel OLAP insert paths. -
Critical checkpoint: concurrency
Conclusion: No new concurrency or locking risk identified in the changed code. -
Critical checkpoint: lifecycle / static initialization
Conclusion: No special lifecycle or static initialization issue identified. -
Critical checkpoint: configuration
Conclusion: No configuration changes. -
Critical checkpoint: compatibility
Conclusion: No FE/BE protocol or storage compatibility issue identified from this patch. -
Critical checkpoint: functionally parallel paths
Conclusion: Not satisfied.RemoteOlapInsertExecutorremains inconsistent withOlapInsertExecutor; it still registers a local load job but does not finish that same local job on either success or failure. -
Critical checkpoint: special conditions / assumptions
Conclusion: The newgroup_commitTVF exclusion is clearly commented and justified. -
Critical checkpoint: test coverage
Conclusion: Insufficient for the bug being fixed. The patch needs coverage for remote OLAP insert success/failure so this leak does not regress again. -
Critical checkpoint: observability
Conclusion: Existing logs are probably adequate for debugging this area; no additional observability requirement stands out. -
Critical checkpoint: transaction / persistence
Conclusion: The modified code still relies onLoadManager.recordFinishedLoadJob(...)for final job persistence, but the remote path does not reach the correct localLoadManagerinstance, so persistence/completion is still incomplete there. -
Critical checkpoint: data writes / atomicity
Conclusion: No new data atomicity issue identified beyond the load-job lifecycle leak. -
Critical checkpoint: FE/BE variable passing
Conclusion: No new FE/BE variable propagation issue identified. -
Critical checkpoint: performance
Conclusion: No material performance concern identified in the changed code. -
Critical checkpoint: other issues
Conclusion: The remaining remote-path leak is the blocking issue.
...c/main/java/org/apache/doris/nereids/trees/plans/commands/insert/InsertIntoTableCommand.java
Outdated
Show resolved
Hide resolved
|
run buildall |
|
/review |
There was a problem hiding this comment.
Found 1 blocking issue.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/plans/commands/insert/OlapInsertExecutor.java: this patch changesOlapInsertExecutorto require a newneedRegisterconstructor parameter, but the other subclasses (OlapTxnInsertExecutor,OlapGroupCommitInsertExecutor, andRemoteOlapInsertExecutor) still call the removed 7-argumentsuper(...). FE does not compile in this state, so the fix is not complete.
Critical checkpoint conclusions:
- Goal and correctness: Partially met. The intended local OLAP failure-path fix is reasonable, but the patch does not build, so the task is not accomplished end-to-end.
- Scope and clarity: Small and focused overall, but the constructor refactor is incomplete across parallel OLAP executor paths.
- Concurrency: No new concurrency or lock-order issue observed in the touched code.
- Lifecycle and static init: No special lifecycle or static-initialization issue observed.
- Configuration: No new configuration items.
- Compatibility: No storage-format or FE/BE protocol compatibility change identified in the touched code.
- Parallel code paths: Not fully handled. Other
OlapInsertExecutorsubclasses still rely on the old constructor signature. - Special conditional checks: The new
group_commitTVF skip condition has a clear rationale comment. - Test coverage: No regression or unit test accompanies this insert-job lifecycle fix, so the failure and leak paths remain untested.
- Observability: Existing logging and error propagation are adequate for this change.
- Transaction and persistence: The intended
recordFinishedLoadJobusage follows existing persistence flow and would mark failed jobs asCANCELLED, but the compile break prevents end-to-end validation. - Data writes and atomicity: No new write-atomicity issue identified in the intended logic.
- FE/BE variable passing: Not applicable.
- Performance: No notable performance concern in the touched paths.
- Other issues: The constructor break is the primary blocker.
| long jobId) { | ||
| super(ctx, table, labelName, planner, insertCtx, emptyInsert, jobId); | ||
| long jobId, boolean needRegister) { | ||
| super(ctx, table, labelName, planner, insertCtx, emptyInsert, jobId, needRegister); |
There was a problem hiding this comment.
Changing OlapInsertExecutor to an 8-argument constructor breaks the subclasses that still call the old 7-argument super(...). In this patch, OlapTxnInsertExecutor (line 46), OlapGroupCommitInsertExecutor (line 69), and RemoteOlapInsertExecutor (line 81) still invoke the removed signature, so FE no longer compiles. Please either keep the old overload or update all subclasses in the same change.
fix #62118
Problem
InsertLoadJobis registered withLoadManagerfor all insert types during executorconstruction, but most executors never transition the job to a terminal state, causing
permanent PENDING leaks:
afterExecnever callsrecordFinishedLoadJobafterExecis emptyafterExecis skipped on failure —executeSingleInsertdoes earlyreturnin catch block before reachingafterExecafterExecFix
1. Introduce
needRegisterflag inAbstractInsertExecutorAdd
boolean needRegisterto the constructor to make job registration an explicitopt-in. A default 7-arg constructor sets
needRegister = false, so any executor thatdoes not pass the flag will not register. Only
OlapInsertExecutorandRemoteOlapInsertExecutorpassneedRegister = (jobId != -1), preserving existingbehavior for inline-table and group_commit TVF cases where
jobIdis already-1.This replaces the implicit pattern where
jobId = -1was used as a side-channel toskip registration, which was easy to miss when adding new executor types.
2. Fix OLAP insert failure leak
Extract job recording into
recordLoadJob(UserIdentity)and call it from both:afterExec—errMsgis empty → job transitions toFINISHEDonFail—errMsgis set → job transitions toCANCELLED